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Abstract —  Development  of  robust  prognostics  for  digital 
electronic  system  health  management  will  improve  device 
reliability  and  maintainability  for  many  industries  with  products 
ranging  from  enterprise  network  servers  to  military  aircraft. 
Techniques  from  a  variety  of  disciplines  is  required  to  develop  an 
effective,  robust,  and  technically  sound  health  management 
system  for  digital  electronics.  The  presented  technical  approach 
integrates  collaborative  diagnostic  and  prognostic  techniques 
from  engineering  disciplines  including  statistical  reliability, 
damage  accumulation  modeling,  physics  of  failure  modeling, 
signal  processing  and  feature  extraction,  and  automated 
reasoning  algorithms.  These  advanced  prognostic/diagnostic 
algorithms  utilize  intelligent  data  fusion  architectures  to 
optimally  combine  sensor  data  with  probabilistic  component 
models  to  achieve  the  best  decisions  on  the  overall  health  of 
digital  components  and  systems.  A  comprehensive  component 
prognostic  capability  can  be  achieved  by  utilizing  a  combination 
of  health  monitoring  data  and  model-based  estimates  used  when 
no  diagnostic  indicators  are  present.  Both  board  and  component 
level  minimally-invasive  and  purely  internal  data  acquisition 
methods  will  be  paired  with  model-based  assessments  to 
demonstrate  this  approach  to  digital  component  health  state 
awareness. 1 


Index  Terms —  Automated  reasoning  algorithms,  physics  of 
failure  modeling,  prognostic  and  health  management  (PHM) 

Acronyms 


AF 

Acceleration  Factors 

BIT 

Built-in  Test 

COTS 

Commercial  off-the-shelf 

HASS 

Highly  Accelerated  Lift  Testing 

MOSFET  - 

Metal-oxide-semiconductor  Field  Effect 
Transistor 

MTTF 

Mean  Time  to  Failure 

MP 

Microprocessor 

PHM 

Prognostics  and  Health  Management 

PoF 

Physics-of-failure 

RISC 

Reduced  Instruction  Set  Computer 

RUL 

Remaining  Useful  Life 
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1.  Introduction 

DIGITAL  electronic  boards  are  found  in  numerous  facets 
of  modern  day  life  where  consumers  have  come  to 
depend  on  their  reliability  to  operate  effectively  in  both 
both  professional  and  private  endeavors.  Furthermore,  the 
commercial  and  military  markets  demand  even  greater 
reliability  constraints  on  semiconductor  manufactures  where  a 
system  failure  could  produce  catastrophic  results.  Diagnostic 
methods  have  been  implemented  in  a  variety  of  existing 
electronic  systems  (e.g.  BIT),  which  are  effective  in 
identifying  sources  of  malfunctions  post-failure  within  the 
system;  however,  fail  to  track  system  usage  throughout  the 
systems’  lifespan  necessary  when  attempting  to  offer 
instantaneous  health  state  assessments.  A  clear  opportunity 
and  vital  need  exists  to  improve  digital  electronic  system 
health  state  awareness  and  prediction  through  development  of 
PHM  techniques. 

The  goal  of  proactive  fault  monitoring  is  to  prevent  the  end 
user  from  experiencing  the  effects  of  the  failure  and  ideally 
provide  advanced  notice  of  impending  failure  in  due  time  to 
allow  corrective  measures  to  be  taken  prior  to  failure  (i.e. 
reduce  duty  cycle,  offload  utilization,  or  schedule  repair). 
Achieving  this  objective  requires  knowledge  of  how 
component-level  failure  manifests  throughout  the  system  and 
insight  as  to  which  measurands  offer  indication  of  incipient 
signs  of  failure.  In  this  paper,  the  authors  illustrate  how 
cradle-to-grave  health  state  awareness  can  be  achieved 
through  the  teaming  of  model-based  assessments  in  the 
absence  of  fault  indications  and  a  data  driven  approach  used 
to  track  indicators  of  failure  providing  failure  mode 
classification.  Test  results  from  accelerated  testing  of  a 
CMOS  device  are  presented  as  a  basis  to  indicate  the  ability 
to  capture  fault  indicators  indicating  impending  failure  and 
track  the  degradation  of  performance  measurands.  The 
application  of  complementary  prognostic  techniques  such  as 
physics-based  component  damage  accumulation/aging  models 
based  on  projected  operating  conditions,  empirical  (trending) 
models,  and  system  level  failure  progression  models  are 
discussed  as  providing  a  solid  foundation  on  which  to  develop 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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which  to  develop  verifiable  prognostics  assessment. 

The  ultimate  implementation  of  the  technology  developed 
under  this  program  will  provide  a  comprehensive  and 
effective  diagnostic/prognostic  solution,  requiring  minimal 
sensor  retrofitting  or  hardware  modifications,  suitable  for 
deployment  on  wide-ranging  applications  across  the  multiple 
facets  of  digital  electronics  including  integration  into  digital 
electronic  boards  residing  in  COTS  embedded  computer 
system,  industrial  or  commercial  computing  platforms,  or 
adding  capabilities  to  automated  test  equipment. 

2.  Device  Reliability 

The  reliability  throughout  the  lifespan  of  a  population  of 
devices  is  commonly  illustrated  through  the  familiar  “Bathtub 
Curve”,  shown  in  Fig  1.  The  curve  coveys  an  initially  high, 
yet  decreasing  rate  of  failure,  at  the  conception  of  the  device 
due  to  anomalies  in  manufacturing  process,  handling  or 
installation  defects.  Manufactures  generally  perform  “burn-in” 
tests  or  HASS  to  purge  the  population  of  the  units  prone  to 
premature  failure;  doing  so  advances  the  remaining 
population  to  the  “useful”  life  stage  where  failures  still  occur, 
yet  at  a  low  and  assumed  to  be  constant  rate.  The  final  phase 
of  life,  referred  to  as  the  wearout  stage,  occurs  when  time- 
dependent  environmental,  electrical  or  mechanical  stress  age 
the  physical  properties  of  the  device  past  nominal  operation 
limits  increasing  the  likelihood  of  failure  amongst  the 
remaining  population. 


Fig  1:  Device  Reliability  illustrated  through  the  “Bathtub” 
Curve 

Analysis  of  failures  throughout  the  lifespan  identifies 
numerous  failure  mode  possibilities  triggered  by  various 
failure  mechanisms.  A  failure  mechanism  is  defined  as  the 
physical  phenomenon  causing  the  onset  of  failure;  common 
examples  in  mechanical  systems  are  vibration,  corrosion,  high 
friction,  etc.  The  underlying  failure  mechanism  becomes 
evident  to  the  user  through  failure  modes  which  are  tangible 
observations  of  how  the  system  or  device  failed;  for  example 


for  example  overheating,  unexpected  shutdown,  and  reduced 
performance  are  observable  failure  modes.  Commonly,  single 
failure  modes  can  be  attributed  to  multiple  failure 
mechanisms. 

The  approach  presented  herein  identifies  specific  failure 
mechanisms  prevalent  in  triggering  failure  throughout  the 
“useful”  and  wearout  stages  of  life.  Accelerated  aging 
techniques  were  selected  and  applied  to  test  articles  to 
increase  the  likelihood  of  failure  due  to  the  desired  failure 
mechanism.  The  devices  tested  were  verified  as  successfully 
passing  “burn-in”  procedures  performed  by  the  manufacturer 
aimed  at  decreasing  the  likelihood  of  early  onsets  of  failure, 
thereby  shifting  the  sampled  population  towards  the  “useful” 
life  stage.  Fig  2  illustrates  the  concept  of  multiple  sources  of 
failure  modes,  randomly  distributed  in  time  and  normally 
distributed  in  contribution  to  failure  rate,  vertically 
amalgamating  to  the  constant  failure  rate  assumed  throughout 
normal  life.  Through  selectively  applying  accelerated  aging 
techniques,  targeting  individual  underlying  failure 
mechanisms,  individual  failure  modes  may  be  investigated  by 
pushing  the  device  towards  the  wearout  phase  of  life  enabling 
observation  of  system  level  responses  and  performance 
degradation  as  the  end  of  life  approaches.  These  characteristic 
changes  as  the  device  transitions  from  useful  life  to  end  of  life 
are  of  most  interest  when  attempting  to  identify,  classify,  and 
track  incipient  signs  of  impending  failure. 


Fig  2:  Contribution  of  Multiple  Failure  Modes  to  Device 
Reliability 

3.  Digital  Device  Failure  Mechanisms 

The  realm  of  digital  devices  is  vast,  spanning  devices  from 
FPGAs  and  DSPs  to  general  purpose  processors  and  certain 
forms  of  volatile  and  non-volatile  memory.  Despite  functional 
and  topological  dissimilarities,  all  digital  devices  share  a 
common  functional  dependence  on  semiconductor  devices, 
specifically  transistors.  Moreover,  MOSFETs  are  ubiquitous 
in  digital  electronics  accounting  for  close  to  99%  of  the  FET 
market  [1].  Thus,  understanding  the  physics-of-failure  at  the 
transistor  device  level  is  paramount  when  attempting  to 
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attempting  to  quantify  failure  modes  and  mechanisms  within 
digital  systems. 

Devices  are  continually  subjected  to  aging  through  electrical, 
mechanical  and  environmental  stresses  inherent  to  operational 
conditions  throughout  a  given  lifetime.  Development  of 
physics  based  diagnostic  and  prognostic  analyses  of  the 
device  or  system  health  status  is  possible  if  knowledge  of  the 
time-dependant  effects  of  these  aging  processes  can  be 
determined.  Documented  semiconductor  PoF  models  can  be 
used  as  a  basis  to  derive  system  level  models  describing  the 
tendencies  and  responses  of  the  system  as  it  reacts  over  time 
to  the  environmental  conditions  present.  There  are  four  main 
semiconductor  failure  mechanisms  that  contribute  to  aging 
tendencies  of  MOSFET  devices: 


1 .  Thermal  cycling 

2.  Electromigration 

3.  Hot  carrier  effects 

4.  Time-dependent  dielectric  breakdown 

The  vast  majority  of  semiconductor  devices  are  based  on 
silicon  fabrication;  however  the  following  failure  mechanisms 
can  be  extended  to  other  materials  such  as  silicon-germanium, 
gallium  arsenide,  and  silicon  carbide  providing  a  powerful 
foundation  to  analyze  virtually  all  digital  devices. 


A.  Thermal  Cycling 

Thermal  cycling  is  one  of  the  main  environmental 
acceleration  factors  that  produce  MOSFET  aging.  Device 
degradation  occurs  because  thermal  cycling  deteriorates  the 
thermal  circuit  which  allows  the  device  to  release  generated 
heat. 

When  a  device  composed  of  multiple  materials,  such  as  an  IC, 
is  exposed  to  the  stress  of  thermal  cycling,  it  deteriorates  until 
a  fracture  or  void  space  is  produced  (see  Fig  3).  Dissimilar 
materials  are  used  to  produce  the  heat  transfer  path  to  release 
heat  generated  by  a  functioning  semiconductor  device.  In 
general,  these  materials  have  different  coefficients  of  thermal 
expansion  that  make  the  device  more  susceptible  to  cracks  or 
fractures  due  to  the  forces  originated  by  thermal  expansion 
and  contraction.  These  fractures  among  different  materials 
deteriorate  the  functionality  of  the  device,  but  do  not  directly 
interfere  with  the  software  operation  of  the  device. 


Fig  3:  Void  Area  Creation  Process  Due  to  Thermal  Cycling 

The  cracks  caused  by  thermal  cycling  compromise  the 
semiconductor’s  ability  to  transfer  heat.  The  reduction  in 


conduction  ability  does  not  destroy  the  semiconductor  itself, 
but  accelerates  the  aging  process  for  other  failure 
mechanisms. 

Damage  caused  by  thermal  cycling  accumulates  every  time  a 
device  experiences  a  power-up  and  a  power-down  cycle. 
Thermal  cycling  eventually  weakens  metallic  contacts, 
triggering  the  occurrence  of  gate-oxide  breakdown  or  contact 
migration.  The  Coffin-Manson  model,  shown  below,  can  be 
used  to  estimate  the  number  of  thermal  cycles  before  failure 
for  a  specific  device. 


Nf=Ca-(AT-ATor  (i) 

Nf  =  Number  of  cycles  to  failure  AT0  =  Cycle  in  the  Plastic  region 

C0  =  Material-  Dependant  Constant  q  =  Material-Dependant  Constant 
AT  =  Entire  Temperature  Cycle  (see  Table  1) 


Table  1:  Coffin-Manson  Coefficients 


Material 

q 

Ductile  metal  (solder) 

1-3 

Hard  metal  alloys  (Al-Au) 

3-5 

Brittle  fracture(  Si  and  Dielectrics) 

6-9 

B.  Electromigration 

Electromigration  is  the  mass  transport  of  the  metal  due  to 
momentum  transfer  between  the  conducting  electrons  and  the 
diffusing  metal  atoms.  This  phenomenon  was  observed  and 
defined  in  metals,  but  can  also  be  related  to  highly  doped 
semiconductors  (with  negative  thermo  impedance).  The  50th 
percentile  time  to  failure  due  to  electromigration  is  calculated 
using  the  equation  given  below. 


ho=Ao-r2-ekT  (2) 

A0  =  Constant  T  =  Temperature  in  K 

J  =  Current  density  k  =  Boltzmann’s  constant 

Ea  =  -0. 1  to  0.2eV 

C.  Hot  Carrier  Effects 

As  MOSFETs  begin  to  age,  the  dielectric  material  of  the 
device  begins  to  degrade.  The  silicon  dioxide  (Si02)  bonds  of 
the  dielectric  break  as  a  result  of  interaction  between  highly 
charged  electrons,  also  known  as  hot  carriers.  This 
phenomenon  is  very  important  in  MOSFET  technology  where 
the  presence  of  high  electric  fields  facilitates  the  creation  of 
hot  carriers,  as  shown  in  Fig  4. 
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The  four  common  hot  carrier  injection  mechanisms  are  [2]: 

1 .  Drain  avalanche  hot  carrier  injection  (DAHC) 

2.  Channel  hot  electron  injection 

3.  Substrate  hot  electron  injection 

4.  Secondary  generated  hot  electron  injection 

Drain  Avalanche  Hot  Carrier  (DAHC):  This  phenomenon 
produces  the  most  accelerated  device  degradation  under 
normal  operating  temperatures.  This  occurs  when  the  voltage 
applied  at  the  drain  under  non-saturated  conditions  is  higher 
than  the  voltage  applied  to  the  gate  (Vd>Vg).  High  electric 
fields  found  near  the  drain  accelerate  the  carriers  into  the 
drain's  depletion  region. 

Acceleration  of  the  channel  carriers:  This  phenomenon, 
also  known  as  impact  ionization ,  occurs  when  the  accelerated 
carriers  collide  with  Si  lattice  atoms,  creating  electron-hole 
pairs  in  the  process.  The  displaced  electron-hole  could  gain 
enough  energy  to  overcome  the  electric  potential  barrier 
between  the  silicon  substrate  and  the  gate  oxide,  producing 
gate  isolation  deterioration.  This  leads  to  an  increase  in  the 
gate  current  and  a  reduction  in  the  sub-threshold  voltage  (Vth). 


Substrate  hot  electron  injection:  Due  to  the  influence  of  the 
drain-to-gate  field,  hot  carriers  are  generated  in  the  substrate. 
These  hot  carriers  are  injected  and  become  trapped  in  the  gate 
oxide  layer,  causing  the  same  degradation  as  DAHC. 


Secondary  generated  hot  electron  injection:  The  number  of 
electrons  that  become  trapped  in  the  interface  between  doped 
regions  grows  over  time  modifying  the  threshold  voltage  (Vth) 
and  its  conductance  (gm). 


t5Q=B0-(iy-e 


K 

kT 


B0=  Constant 
n  =  2-4 

Ea  =  -0.1  to  0.2eV 


k  =  Boltzmann’s  constant 
T  =  temperature  in  Kelvin  (K) 

I  =  N-channel  -  peak  substrate  current, 
P-Channel  -  peak  gate  current 


(3) 


An  example  of  a  Si02  progressive  breakdown  in  a  MOSFET 
is  shown  in  Fig  5. 


Fig  5 :  Gate  Current  Increase  in  an  Accelerated  MOSFET 
Aging  Test 


D.  Time -Dependent  Dielectric  Breakdown 

In  general,  time-dependent  dielectric  breakdown  relates  to  the 
Si02  oxidation  barrier  deterioration  under  normal  operating 
conditions.  The  reduction  in  life  can  be  computed  as  [2]: 


'5o=V(^>' 

Bg=  Constant 
a  =78 
b  =-0.081 

X  =  0.759eV 
Y  =  -66.8  eV  K 


X+^+ZT 
bT )  .  e  kT 

Z  =  8.37E-  4  eV/K 
k  =  Boltzmann’s  constant 
T  =  temperature  in  Kelvin  (K) 
V  =  gate  voltage 


(4) 


Gste  yjclcte : 


— j 

Fig  4:  MOSFET  Cross-sectional  Visualization  of  Hot  Carrier 

Effect 

Independent  of  their  origin,  hot  carriers  produce  two  types  of 
deterioration  in  FET  technologies.  The  first  is  acceleration  in 
time-dependant  electrical  breakdown  of  the  oxide  barrier 
(Si02),  and  the  second  is  migration  and  degradation  of  the 
semiconductor. 


These  PoF  mechanisms  serve  as  the  basis  for  the  accelerated 
aging  processes  performed  on  the  test  devices. 

4.  TARGET  DEVICE  SELECTION 

An  appropriate  digital  component  or  device  was  needed  to 
focus  the  development  of  PHM  technology  upon.  A  range  of 
digital  component  categories  was  considered  for  evaluation: 

■  Digital  Signal  Processor 

■  Microprocessor 

■  Microcontroller 

■  Field  Programmable  Gate  Array 

■  Application  Specific  Integrated  Circuit 

■  Static/Dynamic  Random  Access  Memory 
(SRAM/DRAM) 


A  median  time-to-failure  approximation  for  hot  carrier 
injection  is  given  below: 
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While  each  of  these  digital  component  categories  typically 
serves  different  functional  purposes,  they  are  structurally  very 
similar.  With  the  transistor  as  the  common  denominator,  the 


denominator,  the  component’s  function  may  have  a  greater 
influence  on  its  susceptibility  to  faults  than  the  actual 
architecture.  For  example,  memory  devices  are  regularly 
implemented  with  built-in  error  checking  for  a  certain  level  of 
fault  tolerance;  FPGA’s  often  run  massively  parallel, 
independent  operations  where  a  fault  to  a  single  element  may 
have  a  negligible  impact  on  the  operation  of  the  entire 
component.  Microprocessors  are  a  category  of  digital 
components  that  may  be  more  susceptible  to  faults  due  to  their 
complexity,  large  scale,  and  generally  demanding  role  and 
responsibilities.  These  characteristics  suggest  a  more 
significant  risk  associated  with  an  undiagnosed  fault  in  a 
microprocessor,  and  a  greater  need  for  effective  PHM.  In 
consideration  of  this  information,  the  microprocessor  was 
selected  as  the  focus  of  development. 

A  desire  to  use  commercially  available  products  meeting 
preferred  testing  parameters  resulted  in  identification  of 
Genesi’s  Pegasos  PowerPC  computing  platform.  The 
PegasosPPC  utilizes  a  360  CBGA  MPC7447  processor  on  an 
affordable  and  completely  removable  edge  card  which  inserts 
into  a  fully  populated  motherboard  (see  Fig  6).  The  MPC7447 
host  processor  is  a  high-performance,  low-power  32-bit 
implementation  of  the  PowerPC  RISC  architecture  with  a  full 
128-bit  implementation  of  Freescale's  AltiVec™  technology 
[7].  It  has  a  robust  data  processing  core  incorporating  a 
powerful  128-bit  vector  processing  unit,  double-precision 
floating-point  arithmetic  unit,  superscalar  data  bus 
architecture,  and  sizable  on-chip  L2  cache  memory.  The 
capabilities  of  the  MPC7447  are  representative  of  processors 
commonly  used  in  military,  commercial  and  private  digital 
systems  thus  an  ideal  point  of  origin  for  digital  PHM 
development. 


Host  Motherboard 


Fig  6:  Processor  Test  Platform 

One  attractive  feature  of  this  product  lies  in  the  separability  of 
the  processor  from  its  supporting  circuitry  (e.g.  north/south 
bridge,  interface  controllers  and  memory).  The  power  filtering 
components  accompanying  the  processor  on  the  edge  card, 
shown  in  Fig  6,  were  easily  removed  and  replicated  on  an 
intermediary  board.  Isolating  the  processor  in  such  a  manner 
facilitated  a  primary  objective  of  focusing  accelerated  aging 
exclusively  on  the  processor  itself  thereby  increasing  the 
likelihood  that  end-of-life  system  failure  would  originate 
within  the  processor. 


5.  Accelerated  Failure  Testing 

Operating  conditions  that  exceed  specified  conditions  are 
commonly  referred  to  as  acceleration  factors.  Electrical  AF 
depend  on  device  parameters  such  as  voltage  and  current 
whereas  mechanical  AF  depend  on  the  geometry  or  packaging 
of  a  device,  or  the  mechanical  stress  on  solder  joints  and 
metal  interconnects.  The  remaining  AF  may  depend  on 
environmental  conditions,  such  as  ambient  temperature, 
external  electromagnetic  interference,  humidity,  or  radiation 
and  are  known  as  environmental  AF.  The  accelerated  failure 
testing  performed  introduced  specific  electrical  and 
mechanical  stresses  into  the  system.  By  subjecting  the  pP  to 
elevated  operating  conditions  outside  the  specified  operating 
range,  accelerated  aging  rapidly  advanced  the  pP  through  its 
normal  operational  lifespan  to  the  wearout  phase  ultimately 
leading  to  observable  failure. 


The  following  three  variants  of  accelerated  tests  were 
performed: 

1 .  Thermal  Oscillation 

2.  Combinational  Environment 

3.  Thermo-Electrical  Stress 

A.  Thermal  Oscillation 

A  pP  daughtercard  was  placed  in  a  programmable  thermal 
chamber  (see  Fig  7)  where  the  temperate  was  oscillated 
between  preset  limits  with  one  hour  cycle  time  to  maximize 
the  number  of  cycles  performed  per  day.  The  temperature 
extremes  employed  for  thermal  cycling  tests  were  extended 
beyond  the  minimum  and  maximum  specified  device  storage 
temperatures  (-55°C  and  +150°C)  [7].  Baseline  tests  were 
conducted  before  and  after  cycling  due  to  the  impracticality  of 
operation  with  the  motherboard  while  in  the  chamber. 


Fig  7:  Environmental  Test  Chamber  used  for  Thermal 
Oscillation  and  Thermal  with  Vibration 


B.  Combinational  Environment 

A  pP  daughtercard  was  secured  to  a  custom  vibration  fixture 
(see  Fig  8)  connected  to  an  electro-dynamic  shaker  and 
installed  into  the  thermal  chamber.  The  pP  daughtercard  was 
subjected  to  a  fixed  frequency  sine  wave  vibration,  imperially 
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imperially  tested  to  induce  maximum  response  (i.e.  natural 
frequency),  while  simultaneously  being  subjected  to  identical 
thermal  cycling  as  described  above. 


Fig  8:  Internal  View  of  Environmental  Test  Chamber 


C.  Thermo-Electrical  Stress 

The  power  supplied  to  the  pP  under  test  was  switched  from 
the  motherboard  ATX  supply  to  a  variable  linear  benchtop 
supply.  A  pP  daughtercard  was  operated  normally  with  the 
motherboard  while  the  variable  external  supply  applied  the 
necessary  core  power  to  the  processor  while  the  motherboard 
continued  to  source  power  from  the  ATX  supply.  By  isolating 
the  processor  power  from  the  motherboard  power,  control 
over  the  aging  profile  of  the  processor  was  achieved.  Once 
nominal  operating  conditions  were  established,  the  voltage 
was  increased  allowing  the  pP  to  operate  at  voltages  beyond 
operational  limits.  The  initial  voltage  for  the  thermo-electrical 
accelerated  failure  test  was  selected  to  coincide  with  the 
maximum  operating  voltage  and  temperature  (1.6V  and 
100°C,  respectfully)  specified  by  [7].  The  standard  heat  sink 
was  removed  and  a  120  cfm  fan  was  focused  on  the  pP  to 
maintain  a  nominal  temperature  of  100°C  as  subsequent  trials 
progressed.  Separate  trials  were  conducted  restricting  and 
unrestricting  the  die  temperature  as  the  core  voltage  was 
increased. 


programming  environment  to  measure,  record  and  analyze  test 
results  (see  Fig  9). 


Fig  9:  Data  Acquisition,  Logging,  and  Analysis/Test  Suite 
Interface 


A.  Data  Analysis 

The  analysis  of  test  data  from  the  myriad  of  tests  performed 
showed  a  distinct  ability  to  identify  and  capture  incipient 
signs  of  failure  prior  to  functional  failure  of  the  system. 
Moreover,  the  varied  accelerated  aging  processes  illustrated 
discernable  trends  in  degradation  progression,  as  shown  by  a 
comparison  of  Fig  10  and  Fig  11,  underwriting  the  ability  to 
identify  individual  modes  of  failure  and  develop  effective 
PHM  techniques  for  digital  systems. 
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Fig  10:  Feature  Tracking  of  Thermo-Electrical  Aging  Process 


6.  Testing  Results 

Impact  developed  a  suite  of  test  algorithms  in  a  Linux 
operating  environment  to  provide  baseline  tests  prior  to  and 
after  each  accelerated  aging  process.  The  test  suite  allowed 
independent  and  simultaneous  exercise  of  each  execution  unit 
present  within  the  pP  (i.e.  simple  ALU,  complex  ALU, 
AltiVec  unit,  etc.),  enabling  analysis  of  functional  degradation 
of  individual  operational  sectors  and  total  processor 
utilization.  To  ensure  complete  processor  operation,  the 
instruction  fetch,  memory  and  load/store  sectors  where 
inherently  accessed  when  exercising  execution  units.  In 
addition,  the  test  suite  provided  vital  control  over  loop 
iterations  and  number  of  runs  allowing  optimization  of  run 
time  for  each  unit  ensuring  measurable  performance  results. 
Additional  software  was  developed  in  the  Lab  VIEW 


The  baseline  measurements  taken  after  successive  Thermo- 
Electrical  aging  processes  revealed  an  exponentially 
increasing  trend  in  the  highest  fidelity  feature  measurement 
(see  Fig  10).  Each  aging  cycle  escalated  the  core  processor 
voltage,  thereby  increasing  the  electric  field  between  the  gate 
and  substrate  region  in  addition  to  increasing  the  electron 
mobility  through  the  drain-to-source  channel.  The 
combination  of  these  phenomena  accelerated/excited  electrons 
to  the  point  of  becoming  trapped  within  the  silicon  dioxide 
(Si02)  dielectric.  This  testing  procedure  was  deemed  effective 
in  accelerating  and  analyzing  failure  modes  associated  with 
electric  failure  mechanisms,  such  as  electromigration,  hot 
carrier  effect,  and  TDDB. 
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Fig  1 1 :  Feature  Tracking  of  Thermal  Oscillation  &  Vibration 
Aging  Process 

The  compiled  data  extracted  from  each  baseline  test 
performed  after  every  24  hour  thermal  oscillation  with 
vibration  (i.e.  24  thermal  cycles/test)  identified  a  remarkably 
different  degradation  profile  (see  Fig  11)  to  that  of  the 
Thermo-Electrical  aging  process.  The  highest  fidelity  feature 
measurement  specific  to  vibratory  and  thermal  stress  factors 
indicated  a  linear  degradation  as  the  processor  aged  over  time. 
Overall,  the  measured  data  supports  the  proposed  theory  of 
solder  joint  fatigue  and  void  area  creation;  as  the  processor 
under  test  is  subjected  to  increasing  vibratory  stress,  the 
interconnects  and  solder  joints  begin  to  fatigue  triggering 
distinct  failure  modes,  separable  from  those  identified  through 
electrical  AF. 

The  results  from  baseline  tests  conducted  after  each  series  of 
Thermal  Oscillation  showed  no  measurable  effects  to  the 
overall  health  of  the  processor.  The  DUT  was  subjected  to 
over  400  hours  of  temperature  oscillations  beyond  maximum 
device  ratings  with  no  discernable  effect.  Although  Thermal 
Oscillation  was  deemed  the  least  efficient  method  of 
accelerated  aging  within  the  allotted  time  frame  to  evoke 
rapid  damage  within  the  processor,  Impact  recognizes  that 
thermal  cycling,  along  with  other  environmental  stress  factors 
play  an  important  role  in  digital  systems  and  is  continuing  to 
pursue  this  type  of  failure  progression  in  ongoing 
development  endeavors. 

B.  Representative  Life  Consumption  Assessment 

The  data  acquired  during  Thermo-Electric  accelerated  life 
testing  supports  the  use  of  the  Hot  Carrier  Effect  failure 
mechanism  to  support  damage  accumulated  in  the  pP.  A 
derivation  based  on  the  associated  MTTF  approximation 
yielded  a  model  that  effectively  accounts  for  the  life  of  the  pP 
consumed  as  a  result  of  time  spent  operating  at  increased 
temperatures.  It  is  reasonable  to  expect,  and  indeed  was 
demonstrated  in  testing,  that  operating  a  pP  at  temperatures 
significantly  above  those  recommended  reduces  the  life  of  the 
pP  at  vastly  accelerated  rates. 

A  representative  model  with  assumed  coefficients  and  actual 
test  data  provides  an  observable  representation  of  the  effects 
of  temperature  on  an  operating  processor.  The  histogram 
displayed  in  Fig  12  represents  the  entire  life  of  a  sample  pP. 


The  vertical  bars  show  the  amount  of  time  the  pP  was 
operated  at  discrete  temperatures.  It  can  be  observed  that 
extensive  operation  at  low  temperature  has  a  largely 
insignificant  effect  on  the  total  life  of  the  unit.  Brief  periods 
of  operation  at  increasingly  higher  temperatures  consume 
larger  and  larger  fractions  of  the  unit’s  total  life. 


Representative  Life  Consumption  Model 


Fig  12:  RUL  Assessment  Based  Upon  Modeled  Feature  Data 

Further  testing  is  required  to  empirically  determine  accurate 
approximated  values  as  well  as  providing  a  distribution  of 
lifetimes  to  use  as  an  estimate  of  total  life. 

7.  Path  Forward 

In  addition  to  developing  system  level  physics  of  failure 
models,  Impact  Technologies  is  leveraging  an  existing  Impact 
diagnostic/validation  technology  termed  SignalPro  which  is 
capable  of  learning  the  relationships  between  an  arbitrary  set 
of  inputs  (be  they  features  or  raw  sensor  values)  to  evaluate  a 
digital  boards  and  its  components  at  a  system  level.  SignalPro 
represents  a  data  driven  condition  monitoring  approach  to 
diagnostics  with  prognostics  provided  by  trending. 

Impact  Technologies’  SignalPro  analysis  engine  offers  a 
system  monitoring  approach  that  can  be  used  to  evaluate 
electronics  system  performance  by  employing  a  combination 
of  signal  processing,  statistics,  and  data-driven  modeling 
techniques.  A  complete  SignalPro  system  model  is  created  by 
evaluating  “healthy”  data  during  a  process  called  training.  The 
generated  model  captures  the  interrelationships  among  sensor 
readings  or  extracted  features.  During  this  training  period, 
signal  preprocessing  is  performed  and  the  signal  relationships 
and  acceptable  deviations  are  quantified. 

Previously  acquired  historical  data  are  captured  and  sent  to 
the  training  engine,  which  finds  the  most  efficient  system 
representation.  Statistical  and  correlation-based  features 
extracted  from  these  data  further  characterize  the  individual 
signal  behaviors.  Finally,  an  empirical  system  model  is 
created  that  captures  these  interrelationships  and  the  accepted 
deviations. 
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During  monitoring,  real-time  data  is  used  with  a  prediction 
model  to  assess  whether  the  system  is  operating  within 
acceptable  limits.  The  model  creates  an  estimate  of  the 
expected  sensor  values  based  on  relationships  between  the 
new  measurements  and  historical  data.  These  data  sets  are 
compared  to  the  actual  data  streaming  in  from  the  system, 
generating  a  residual  signal.  This  residual  signal  is  further 
analyzed  to  reveal  unexpected  (and  potentially  faulty) 
conditions. 


Critic at  Components 

Fig  13:  System  Health  Assessment  and  RUL  Analysis 


In  addition  to  the  data  driven  approach  provided  by  the 
SignalPro  analysis  engine,  PHM  algorithms  are  being  created 
for  critical  components  within  the  system  providing  a  system 
level  model,  incorporating  usage  based  monitoring,  to  add 
prognostic  assessments.  The  health  assessments  provided  by 
each  of  these  independent  paths  can  then  be  fused  at  a  system 
level  reasoner  to  provide  a  high  confidence  analysis  of  the 
health  and  RUL  of  the  electronic  system,  as  illustrated  by  Fig 
13. 


8.  Conclusion 

The  authors  have  shown  the  distinct  ability  to  capture  fault-to- 
failure  progression  data  through  as  series  of  accelerated  aging 
tests  designed  to  isolate  and  increase  the  likelihood  of  failure 
due  to  specific  known  failure  mechanisms.  The  matriculated 
failure  modes  were  quantified  through  minimally  invasive 
monitoring  of  system  feature  data  as  the  device  degraded  over 
time.  The  developed  understanding  of  semiconductor  device 
failure  and  the  ability  to  measure  and  trend  such  shifts  in 
performance  indicates  the  potential  to  develop  prognostic 
health  monitoring  techniques  for  a  wide  breadth  of  digital 
components  and  systems. 

The  achievements  discussed  have  made  the  first  steps  towards 
a  prognostic  ability  for  digital  electronics;  however,  there  is 
considerable  work  left  ahead.  Ongoing  development  of 
prognostic  modeling  algorithms  paired  with  data-driven 


analysis,  fused  with  reasoning  methodologies,  offer  a  viable 
avenue  to  obtain  predictive  insight  into  digital  system 
reliability  and  bring  PHM  for  digital  electronics  to  fruition. 
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